1. How did you go about extracting features from the raw data?
Fourier transform to frequency domain, amplitude normalization, rolling average smoothing and then extract statistical features suggested in the assignment. Also data was divided into two frequency ranges separated by 180 Hz. The peak positions (peak_lo, peak_high) in both ranges were extracted as well as the peak height ratio (p_ratio, p_ratio_submin).
I understand that simple statistical features are very crude and doesn't capture the textural quality of the voices. In hindsight, it would have been better to also extract "adult or not" and language in addition to gender info from the readme file in each folder.
If given longer time and better computer I would make spectrograms for each individual and run a CNN for the classification.
2. Which features do you believe contain relevant information?
By observing the spectrum in freuqency domain, it was easy to notice that Males had low frequency peak occasionally in addition to a high frequency peak as well. This structure also would cause larger interquantile range and standard deviation. Therefore I believed that peak_lo and peak_high, and p_ratio or p_ratio_submin should be important.
3. How did you decide which features matter most?
Three different importance measures were used, i.e., SHAP, Random Forest, and PCA. SHAP and Random Forest importance aggree somewhat. Over all the most important features are: peak_lo (low fre peak position), kurtosis, q1 (1st quantile position), q3 (3rd quantile position), peak_hi (high frequency range peak position), p_ratio_submin (peak ratio corrected by background).
4. Do any features contain similar information content?
From pairploting, it was found that
5. Are there any insights about the features that you didn't expect? If so, what are they?
In fact it is the Kurtosis. Observing the pairplots for data with log transformed kurtosis, it is clear that male has lower kurtosis than females.
6. Are there any other (potential) issues with the features you've chosen? If so, what are they?
peak_lo and peak_high is very crude for capturing the Male vs. Female spectual profile. A CNN-derived feature should work better. However, when solving this problem, my computer became very slow due to some package installation, so I couldn't get enough time to build a CNN model. virtual environment way of working should be enforced also in my home computer!
7. Which goodness of fit metrics have you chosen, and what do they tell you about the model(s) performance?
The dataset is imbalanced with M:F = 13, if classify everyone to be male the accuracy would be 92.8%. Therefore, F1 score was chosen to maximize to for the model performance. F1 score is the harmonic mean between precision and recall defined as follows:
precision: true predicted positive share in predicted positive. (The ability to exclude non-positive class.)
recall: true predicted positive share in actual positive class.(The ability to reach positive class.)
8. Which model performs best?
Random Forest in this case. It is very robust and to data with different type of distributions. Also the bagging within reduces variance that plagues decision trees. In addition, random selection of features helps reducing bias introduced by some dominating features.
9. How would you decide between using a more sophisticated model versus a less complicated one?
Depend on the explanability, scalability, production-compatibility and the business case. If increasing a few % accuracy doesn't bring real insight or benefit and waste time and resources unnecessarily, then a simpler and faster model should be prefered.
10. What kind of benefits do you think your model(s) could have as part of an enterprise application or service?
Speech gender determination problem itself may not have much un-creepy practical usecase. However, machine learning models in general will give extra edge to many decision making processes. In gaming industry, it could help identify valuable players for bonus rewards, which used to be mostly decided by gut-feeling in the past.
load basic packages
%load_ext autoreload
%autoreload 2
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import os, sys
import pandas as pd
CURR_PATH = os.path.join(os.getcwd(), os.pardir)
sys.path.append(CURR_PATH)
load code module make_data_stats.py and paths
import make_data_stats as mkd
DATA_RAW, PROCESSED, CURR_PATH = mkd.DATA_RAW, mkd.PROCESSED, mkd.CURR_PATH
Automated downloading and file extraction >> Gender Info extraction >> Audio data processing >> Noise Reduction >> data observation and visualization>> Build stats features
One can achieve all the steps in the workflow by running make_data_stats.py
python make_data_stats.py 1
project_root:
-data:
-raw:
-raw data, 6247 audio files
-processed:
-extracted data like df_stats_summary.csv etc
-src:
codes and notebooks
mkd.download_extract(DATA_RAW)
The extracted folders are stored in DATA_RAW. Which is project_root/data/raw
mkd.df_init()
df_init = pd.read_csv(PROCESSED +os.sep + 'df_init.csv', sep = ';')
df_init.head(5)
The 'id' column is the name of the folder from each person.
# This only works after all the files are downloaded and extracted
mkd.fig_10samples() # the vertical line is 180 Hz
Due to the fact that male and female voice has different frequency distributions, the statistical features suggested in the assignments are very likely useful.
In addition, peak position of low frequency range (freq <= 180 Hz) and high frequency range (freq >= 180 Hz) were obtained and the ratio between the peak hight was obtained too, and the same ratio corrected by substracting the lowest level of the data is also included. The compelete sets of feature names and the corresponding meanings are as follows:
There are in total 12 features.
The following code acheiveds steps 3 - 6 row by row in df_init:
best_grid_cv# dirname is information contained in df_init.id
The following code acheiveds steps 3 - 6 for all rows in df_init:
mkd.get_dfstas(df, DATA_RAW, PROCESSED, 0) # dirname is information contained in df_init.id
Then the following trys to explore the opportunity of multiprocessing but the time performance improvement is none due to the fact that the essential bottleneck is not the CPU but memory.
make_data_stats(multiproc = False)
The resulting dataset looks like this:
df = pd.read_csv(PROCESSED + os.sep +'stats_summary.csv')
df.head(5)
load mlproc module
import mlproc as ml
df, data, target = ml.preprocess() # raw data without any transformation
ml.sb.pairplot(df[df.columns[~df.columns.isin(['id'])]], hue = 'gender', kind='scatter', plot_kws={'alpha': 0.5})
After the transformation many correlation between features and be discovered.
df, data, target = ml.preprocess(transf=True)
ml.sb.pairplot(df[df.columns[~df.columns.isin(['id'])]], hue = 'gender', kind='scatter', plot_kws={'alpha': 0.5})
Data was split into train (90%) and test (10%) and stratified according to the distributin of y. The reason to keep a lot of data in the training set is to train a better model. The training set was split into 80% and 20% for a 6 fold cross validation.
8 machine learning algorithm was tried:
The pipeline was in 3 steps
The initial model training was done by:
df, data, target = ml.preprocess()
ml.trial_dataparams(data, target)
cls_init = pd.read_csv(PROCESSED + os.sep + 'trial_dataparams_.csv', sep = ';')
cls_init['Model Name'].unique()
cls_init.head(5)
print('Pipeline')
cls_init.iloc[0]['Pipeline']
The best performing model from this initial model was Random Forest without class weight balancing. The data log transform and square transform didn't change the best model or the level of performance of the best model. PCA and choice of scaler also didn't matter very much either. So data with no transformation was used for fine tuning the Random Forest in a pipeline consists of StandardScaler and RandomForestClassifier only.
Grid search with an objective to find maximum F1 score was done by:
ml.randomforest_grids(data, target)
The hierachy was to find the best class_weight (found to be {0: 1, 1: 2} ) and n_estimators (100 performs just as well as 200) first. Then the grid search for max_depth, max_features, and min_samples_leaf was performed.
F1 score of the CV set improved from 0.686 to 0.702. Accuracy had a very slight trade off which droped from 0.960 to 0.958.
rfc_grid = pd.read_csv(PROCESSED + os.sep + 'rfc_grid.csv', sep = ';')
#rfc_grid.columns
best_grid_cv = rfc_grid.sort_values(by = 'mean_test_F1_score', ascending = False)[['params','mean_test_F1_score','mean_test_Accuracy' ]].head(1)
best_grid_cv
print(best_grid_cv.params.values)
The test set F1 score is 0.62.
df_clrepo = pd.read_csv(PROCESSED + os.sep + 'classification_report.csv', sep = ';')
df_clrepo
ml.print_best_model()
Three different importance measure was used: SHAP, Random Forest and PCA.
SHAP :
https://arxiv.org/pdf/1802.03888.pdf
It was proven to be onsistent and model agnostic.
Random Forest importance: It may rank feature important differently when model is changed due to that some high importance features may not have high split counts.
PCA
Only captured the more obvious linear correlations between feature and the variance of the data, so it mistakenly mark mean and meadian as the two most important features.
Both SHAP and Random Forest considered peak_lo feature to be most important. Therefore the existence low frequency bump in the Male spectrum was indeed a deciding factor for Male and Female identification.
The importance obtained by the three methods are illustrated below.
ml.plot_shap_importance(data, target)
ml.plot_bestRF_importance(data)
ml.plot_pca_importance(data, target)